8 research outputs found

    Source Code Retrieval from Large Software Libraries for Automatic Bug Localization

    Get PDF
    This dissertation advances the state-of-the-art in information retrieval (IR) based approaches to automatic bug localization in software. In an IR-based approach, one first creates a search engine using a probabilistic or a deterministic model for the files in a software library. Subsequently, a bug report is treated as a query to the search engine for retrieving the files relevant to the bug. With regard to the new work presented, we first demonstrate the importance of taking version histories of the files into account for achieving significant improvements in the precision with which the files related to a bug are located. This is motivated by the realization that the files that have not changed in a long time are likely to have ``stabilized and are therefore less likely to contain bugs. Subsequently, we look at the difficulties created by the fact that developers frequently use abbreviations and concatenations that are not likely to be familiar to someone trying to locate the files related to a bug. We show how an initial query can be automatically reformulated to include the relevant actual terms in the files by an analysis of the files retrieved in response to the original query for terms that are proximal to the original query terms. The last part of this dissertation generalizes our term-proximity based work by using Markov Random Fields (MRF) to model the inter-term dependencies in a query vis-a-vis the files. Our MRF work redresses one of the major defects of the most commonly used modeling approaches in IR, which is the loss of all inter-term relationships in the documents

    Exploiting Spatial Code Proximity and Order for Improved Source Code Retrieval for Bug Localization

    Get PDF
    Abstract—Practically all Information Retrieval (IR) based approaches developed to date for automatic bug localization are based on the bag-of-words assumption that ignores any positional and ordering relationships between the terms in a query. In this paper we argue that bug reports are ill-served by this assumption since such reports frequently contain various types of structural information whose terms must obey certain positional and ordering constraints. It therefore stands to reason that the quality of retrieval for bug localization would improve if these constraints could be taken into account when searching for the most relevant files. In this paper, we demonstrate that such is indeed the case. We show how the well-known Markov Random Field (MRF) based retrieval framework can be used for taking into account the term-term proximity and ordering relationships in a query vis-a-vis the same relationships in the files of a source-code library to greatly improve the quality of retrieval of the most relevant source files. We have carried out our experimental evaluations on popular large software projects using over 4 thousand bug reports. The results we present demonstrate unequivocally that the new proposed approach is far superior to the widely used bag-of-words based approaches

    AutoBlock: A Hands-off Blocking Framework for Entity Matching

    Full text link
    Entity matching seeks to identify data records over one or multiple data sources that refer to the same real-world entity. Virtually every entity matching task on large datasets requires blocking, a step that reduces the number of record pairs to be matched. However, most of the traditional blocking methods are learning-free and key-based, and their successes are largely built on laborious human effort in cleaning data and designing blocking keys. In this paper, we propose AutoBlock, a novel hands-off blocking framework for entity matching, based on similarity-preserving representation learning and nearest neighbor search. Our contributions include: (a) Automation: AutoBlock frees users from laborious data cleaning and blocking key tuning. (b) Scalability: AutoBlock has a sub-quadratic total time complexity and can be easily deployed for millions of records. (c) Effectiveness: AutoBlock outperforms a wide range of competitive baselines on multiple large-scale, real-world datasets, especially when datasets are dirty and/or unstructured.Comment: In The Thirteenth ACM International Conference on Web Search and Data Mining (WSDM '20), February 3-7, 2020, Houston, TX, USA. ACM, Anchorage, Alaska, USA , 9 page

    Incorporating version histories in information retrieval based bug localization

    No full text
    Abstract—Fast and accurate localization of software defects continues to be a difficult problem since defects can emanate from a large variety of sources and can often be intricate in nature. In this paper, we show how version histories of a software project can be used to estimate a prior probability distribution for defect proneness associated with the files in a given version of the project. Subsequently, these priors are used in an IR (Information Retrieval) framework to determine the posterior probability of a file being the cause of a bug. We first present two models to estimate the priors, one from the defect histories and the other from the modification histories, with both types of histories as stored in the versioning tools. Referring to these as the base models, we then extend them by incorporating a temporal decay into the estimation of the priors. We show that by just including the base models, the mean average precision (MAP) for bug localization improves by as much as 30%. And when we also factor in the time decay in the estimates of the priors, the improvements in MAP can be as large as 80%

    On the use of positional proximity in IR-based feature location

    No full text
    Abstract—As software systems continue to grow and evolve, lo-cating code for software maintenance tasks becomes increasingly difficult. Recently proposed approaches to bug localization and feature location have suggested using the positional proximity of words in the source code files and the bug reports to determine the relevance of a file to a query. Two different types of approaches have emerged for incorporating word proximity and order in retrieval: those based on ad-hoc considerations and those based on Markov Random Field (MRF) modeling. In this paper, we explore using both these types of approaches to identify over 200 features in five open source Java systems. In addition, we use positional proximity of query words within natural language (NL) phrases in order to capture the NL semantics of positional proximity. As expected, our results indicate that the power of these approaches varies from one dataset to another. However, the variations are larger for the ad-hoc positional-proximity based approaches than with the approach based on MRF. In other words, the feature location results are more consistent across the datasets with MRF based modeling of the features. Index Terms—feature location, source code search, software maintenance I
    corecore